# For Seurat objectlibrary(Seurat)# For alternative DE methods:library(MAST)# For parallelizationlibrary(BiocParallel)# For data wrangling, plotting, etc.library(tidyverse) library(cowplot)library(dplyr)library(ggplot2)library(EnhancedVolcano)
Code
proj_dir <-"/Users/nami/Desktop/sterile_inflammation/"out_data_dir <-"/Users/nami/Desktop/sterile_inflammation/output/data/04_major_celltype_annot/"plots_dir <-"/Users/nami/Desktop/sterile_inflammation/output/plots/04_major_celltype_annot/"sobj_dir <-"/Users/nami/Desktop/sterile_inflammation/output/processed/"source(paste0(proj_dir,"/functions/universal.R"))load(paste0(sobj_dir,"03.2_SI_clean_sobj.Robj"))DefaultAssay(seurat_obj) <-"RNA"library(future)oopts <-options(future.globals.maxSize =20.0*1e9) on.exit(options(oopts))f <-future({ expr }) ## Launch a future with large objects
Introduction
Based on the meeting on 03.03.25, we decided to go with SCT and res 0.2. In this document, we will identify marker genes for the seurat_clusterss in our seurat object, and identify major cell types. First we will identify marker genes.
Code
res <-0.2Idents(seurat_obj) <-paste0("SCT_snn_res.", res)DimPlot(seurat_obj, label =TRUE) +NoAxes() +NoLegend()
There was some confusion regarding performing DE analysis after SCTransform. In this context, I found this (https://github.com/satijalab/seurat/issues/2180 and https://github.com/satijalab/seurat/issues/2115):
“The scale.data (in SCTransform slot) is the pearson residuals that come out of regularized NB regression. The counts and data slot transforms these values back into integer values (stored in counts), and then performs a log-transformation (stored in data).
SCTransform (by default) only stores pearson residuals (scale.data) for 3,000 variable features, to save memory. When possible, we try to perform operations directly on the Pearson residuals themselves. However, these values are not sparse (contain exclusively non-zero elements), so they take a lot of memory to store, and as a result we don’t compute them for all genes by default. Unless you change the defaults in SCTransform, performing DE on the scale.data slot would only test differences in variable genes. We also find that pearson residuals are challenging to visualize/interpret on either Feature or Violin plots.So performing DE on the scale.data slot of this assay means you are only testing 3,000 genes. Performing DE on the RNA assay will test all genes.”
Therefore, I have changed the default assay to RNA for marker gene analysis. This slot was already normalized and scaled for downstream analysis in the previous document (03.2_clustering).
Marker genes
I will find top 20 and top 100 marker genes at res 0.2. I also pseudobulked each cluster and found the top 100 highly expressed genes for it. All these csv files are in the marker folder here: https://drive.google.com/drive/folders/1XsziJH3IJk4EJtlv9c2mQ86kQQ04mc0S
Code
res <-0.2Idents(seurat_obj) <-paste0("SCT_snn_res.", res)degs_0.2<-FindAllMarkers(seurat_obj, only.pos =TRUE, assay ="RNA",min.pct =0.25, logfc.threshold =0.1)# Extract top 20 genes for each clustertop_genes <- degs_0.2%>%group_by(cluster) %>%arrange(desc(avg_log2FC)) %>%slice_head(n =20)# Save to CSV filewrite.csv(top_genes, paste0(out_data_dir,tag,"_markers_res0.2.csv"), row.names =FALSE)# Extract top 100 genes for each clustertop_genes <- degs_0.2%>%group_by(cluster) %>%arrange(desc(avg_log2FC)) %>%slice_head(n =100)# Save to CSV filewrite.csv(top_genes, paste0(out_data_dir,tag,"_top_100_markers_res0.2.csv"), row.names =FALSE)res <-0.4Idents(seurat_obj) <-paste0("SCT_snn_res.", res)degs_0.4<-FindAllMarkers(seurat_obj, only.pos =TRUE, assay ="RNA",min.pct =0.25, logfc.threshold =0.1)# Extract top 20 genes for each clustertop_genes <- degs_0.4%>%group_by(cluster) %>%arrange(desc(avg_log2FC)) %>%slice_head(n =20)# Save to CSV filewrite.csv(top_genes, paste0(out_data_dir,tag,"_markers_res0.4.csv"), row.names =FALSE)# Calculate average expression per gene for each cluster using the normalized dataavg_exp <-AverageExpression(seurat_obj, assays ="RNA", slot ="data")# Extract the RNA matrix (genes as rows and clusters as columns)avg_matrix <- avg_exp$RNA# Create an empty list to store the results for each clusterresults_list <-list()# Loop over each cluster to sort genes by average expression and store the top 100 genesfor (cluster incolnames(avg_matrix)) {# Extract the average expression values for the current cluster cluster_exp <- avg_matrix[, cluster]# Sort the expression values in decreasing order sorted_exp <-sort(cluster_exp, decreasing =TRUE)# Get the top 100 genes only top100 <-head(sorted_exp, 100)# Create a data frame with gene names, their average expression values, and cluster info df <-data.frame(Gene =names(top100),AverageExpression = top100,cluster = cluster,row.names =NULL )# Append the data frame to the list results_list[[cluster]] <- df}# Combine the results from all clusters into one data frameall_results <-do.call(rbind, results_list)# Write the combined data frame to a CSV filewrite.csv(all_results, file =paste0(out_data_dir,tag,"_top100__highly_exp_genes_sct_0.2.csv", row.names =FALSE))
Feature Plots
In order to find the major cell types, next we looked at some canonical markers that were also present in the DEGs lists above.
Since I did not see any clusters showing pericyte markers, and pericytes were sorted before scRNA-seq, I decided to go back and see if I lost them during QC. This is before any quality filtering.